Ensembling U-Nets for microaneurysm segmentation in optical coherence tomography angiography in patients with diabetic retinopathy

Diabetic retinopathy is one of the leading causes of blindness around the world. This makes early diagnosis and treatment important in preventing vision loss in a large number of patients. Microaneurysms are the key hallmark of the early stage of the disease, non-proliferative diabetic retinopathy, and can be detected using OCT angiography quickly and non-invasively. Screening tools for non-proliferative diabetic retinopathy using OCT angiography thus have the potential to lead to improved outcomes in patients. We compared different configurations of ensembled U-nets to automatically segment microaneurysms from OCT angiography fundus projections. For this purpose, we created a new database to train and evaluate the U-nets, created by two expert graders in two stages of grading. We present the first U-net neural networks using ensembling for the detection of microaneurysms from OCT angiography en face images from the superficial and deep capillary plexuses in patients with non-proliferative diabetic retinopathy trained on a database labeled by two experts with repeats.

Diabetic retinopathy (DR) is one of the leading causes of blindness among working-age individuals worldwide.It consists of two stages, an earlier non-proliferative (NPDR) stage, and a more advanced proliferative stage (PDR) which occurs when new retinal blood vessels form ('proliferate') often in response to tissue retinal ischemia.During the earlier NPDR stage, patients may be asymptomatic, however microaneurysms (MAs), the hallmark of this stage, already emerge as outpouchings of the retinal blood vessels that are weakened as a result of the sugar overload in the blood [1][2][3][4] .
NPDR can be graded as mild, moderate, and severe and MAs are an early and important clinical sign of disease progression and are a main component of classifying DR severity.Early diagnosis of DR is key for treatment and preserving patient vision since it can prevent blindness in more than 90% of patients 1 .
Fluorescein angiography (FA) is currently the gold standard for the diagnosis of DR and the most sensitive test for detecting MAs.However, it suffers from several drawbacks.During FA imaging, fluorescein, a contrast agent, is injected to highlight the patient's retinal vasculature 3 .In rare cases, fluorescein can lead to an anaphylactic shock in patients that are allergic to it, a reaction that can be fatal if urgent medical intervention is not taken 2 .This makes FA invasive, costly, and time consuming.Furthermore, superposition of retinal capillary layers and leakage pose a challenge to FA, while the deep capillary plexus is barely visible in FA 2,5 .The combination of these factors makes FA less suitable as an ideal screening tool for DR, pushing scientists and engineers to find complementary imaging modalities such as optical coherence tomography angiography (OCTA) 3,6 .OCTA allows the separation of the superficial (SCP) and deep capillary plexuses (DCP) and does not require injections of a contrast agent.OCTA on the other hand does not show all MAs visible with FA as the speed of blood flow within certain MAs is below the threshold of OCTA detection 2 .

Expert labeled database
Training of the network requires a training data set with accurate annotations.There is currently, to the best of our knowledge, no data set of OCTA scans of patients with NPDR/PDR and annotated MAs publicly available.We created a suitable data set ourselves, which was labeled by two expert graders from the New England Eye Center at Tufts Medical Center in Boston.
119 eyes of 70 patients diagnosed with early, intermediate, or severe NPDR or PDR were included in this study.Data were collected from a Zeiss Plex Elite 9000 SS-OCT device with dual-speed 100 kHz and 200 kHz A-scan rate, a lateral resolution of ≤ 20 micrometer, and an axial resolution of 6.3 micrometer.All OCTA images had a signal strength of 6 or grater indicated by the system's software and were qualitatively screened for overall quality and excessive artifacts.The field size of all scans is 6 × 6 mm.
The data was split into 96 eyes (from 52 patients) for training and 23 eyes (from 18 patients) for testing.The system software was used to segment the SCP and DCP of the OCTA scans and to generate en face projections.Table 1 shows the number of patients and diagnosis for the test and training data split.
Our two stage approach for creating an expert labeled database of MAs from the SCP and DCP layers is similar to the one by Bertram et al. 41 .For the first stage, the two expert graders labeled MAs in the en face projections as is shown in Fig. 1.To create the expert labeled database, the graders used the open source web-based labeling tool EXACT to label MAs in both layers 42 .Each MA was annotated by creating a bounding box containing it on the en face projections.Even though MAs were annotated separately in the SCP and DCP, the presented method uses 2D images and 2D convolutions and not volumetric data.Each eye uses 2D fundus images, with the SCP and the DCP being in separate channels.First, the experts labeled MAs in the en face projections independently of each other by reviewing each en face projection image twice.MAs were identified by the experts by examining the available OCTA en face images themselves.
In the next step, the experts had to come to an agreement on each MA label which is necessary for it to become part of the expert labeled database.Only the bounding boxes on which both experts agreed remained in the database.In order to illustrate the challenge of finding and annotating all MAs, we computed the Pearson correlation coefficient for the numbers of MAs per eye labeled by each grader before they had to come to an agreement.How many MAs are labeled by each grader on a given eye can serve as a surrogate for reader agreement.The Pearson correlation coefficient for number of MAs labeled per eye is ≈ 0.21 with a p-value of ≈ 0.045 .A correlation coefficient of 1.0 would indicate perfect agreement, while 0.0 indicates no correlation at all.This helps to illustrate the challenge for readers with respect to finding and annotating MAs.
The contents of the bounding boxes in the en face images were converted to training targets for the network via a thresholding step with examples shown in Fig. 2. A threshold of 150 was applied to the areas enclosed by bounding boxes to generate binary labels for the MAs.This threshold was chosen for the value range of 0 to 255 for the en face images.In order to find this threshold, a small subset of randomly chosen MAs was used to find a threshold that preserved the area of the MAs after thresholding.It is possible that small groups of unconnected pixels, not directly belonging to the MA remain, as seen in Fig. 2.This can be compensated later on by suppressing connected components below a given size on the network's output (details further down below).
The OCTA en face projections were used as training input for the network, while the bounding box annotations were converted to binary ground truth images with per-pixel annotations of MAs for the network output.The process is shown in Fig. 3.The en face images served as input to the network, while the binary masks generated from the en face images and the bounding boxes were used as training targets.Both SCP and DCP en face projections were used as input for the network at the same time.The input used channel one for the SCP and channel two for the DCP.
The database was used for the first stage of training of the networks and to decide on the training parameters.After training this initial network on the training data set via fivefold cross-validation the second stage of the database creation could proceed.The resulting false positives (FPs) and false negatives (FNs) were then reviewed by the expert graders again.Because of the small size of potential MAs and their potentially large numbers, it is a challenge for the graders to find all MAs.Reviewing MAs, which were flagged by the neural network as false positives, can help to identify MAs that have been overlooked by the experts before.Even though the number of MAs in a given eye can be substantial, the overall fraction of all pixels in all images that belong to a MA is relatively small.This means that less than 1 % of all pixels were labeled as belonging to a MA.The database resulting from this two stage process was used for training of the networks and their results in the results and discussion section below.In order to assert the quality of the labeled data set, we show the number of labeled MAs per eye in Fig. 4. The diagram indicated the number of MAs labeled per eye with the given disease severity.I.e. the blue marker near 60 indicated that an eye with the diagnosis of mild NPDR contains 59 labeled MAs.The red markers indicate increasing numbers of MAs coinciding with disease progression.There is a drop from severe NPDR to PDR however, which is likely related to laser treatment in patients.Furthermore, the graders annotated more MAs in the DCP than in the SCP.This is consistent with previous studies, which state that MAs occur more often in the DCP 4,43 .

U-Net
We decided to use a U-net, first published by Ronneberger at al., to segment MAs due to its proven effectiveness for medical segmentation tasks 9 .It consists of a convolutional down-sampling branch which downsizes the image  data while computing features using the filters defined during training.It is completed by an up-sampling branch in order to provide per-pixel labels that match the size of the input.The number of down-sampling steps depends on the size of the input images and structures to be segmented while the intermediate feature maps from the down-sampling branch are also passed on to the up-sampling branch.This preserves spatial information that could otherwise be lost during subsequent down-sampling operations.The combination of these elements makes the U-net architecture a proven network design and candidate for the segmentation of MAs 9,[23][24][25] .
We use nnU-Net as a starting point for our U-net adaptation for MA segmentation.nnU-Net is a generalized toolbox that specializes in providing support for solving segmentation problems in biomedical imaging.It provides a U-net adapted automatically to the dimensions of the images to be trained on.It additionally provides a sane set of default settings and heuristic rules based on properties of the data set.nnU-Net differentiates between three different sets of parameters.The first set is comprised of parameters that remain the same across all potential segmentation tasks, e.g. the U-net architecture, but also the optimizer and its learning rate, number of epochs, the loss function and augmentations.The second set is rule-based and based on the properties of the training data, e.g.intensity distribution, spacing of pixels, and modality (e.g.computed tomography).The third set of parameters is empirical.This means that nnU-Net can make certain choices based on post-processing.The advantage of nnU-Net is that it provides a deep learning pipeline that should lead to usable results without additional changes.Its defaults however, leave room for changes and additional tuning to improve the results delivered by nnU-Net.Additionally, nnU-Net supports ensembling of trained networks.I.e., if enough data are available for a train/test split, the five networks trained on each of the cross-validation folds can be used as an ensemble on the test data.For this, the output of the five nets are averaged.This can lead to an improvement in segmentation performance at the expense of increased training time.nnU-Net's architecture uses skip-connections to avoid over-fitting, a combination of dice and cross-entropy loss, leaky ReLUs as activation function, deep supervision, and it uses stochastic gradient descent with Nesterov momentum for training 9,44 .
Due to the imbalance of the expert labeled database (less than 1% of pixel belong to a MA), we decided to investigate focal loss and dice loss and compared them with the default nnU-Net configuration 45 .We also added a comparison with TransUNet and Swin-Unet, which are two state-of-the-art U-net implementations.TransUNet adds transformers and pre-trained weights to the U-net architecture 46 , while Swin-Unet implements a transformer-based U-shaped encoder-decoder architecture with skip-connections for local-global semantic feature learning 47 .Additionally, we suppressed connected components with a width or height of less than 11 pixels to reduce the number of false positives.All configurations were trained with a learning rate of 0.1.

Results and discussion
We provide both per-pixel and per-MA metrics as part of the evaluation.The metrics per pixel show how many pixels are classified correctly as belonging to a MA or not while the per-MA metrics indicate whether a MA was picked up by a net or not or whether the net detected a false positive MA.Even though the per-pixel metrics help to understand the overall results, we consider the per-MA metrics to be the more clinically relevant metric.Furthermore, we have added comparisons with TransUNet and Swin-Unet 3,46 .Both network architectures serve as a point of reference for the changes we have made to nnU-Net.
Overall, we compare three U-Net configurations, TransUNet, and Swin-Unet: • the original nnU-Net configuration • a new configuration using dice loss, • a new configuration using focal loss, and • TransUNet, which is a state-of-the-art implementation of the U-net architecture adding transformers and pre-trained weights.• Swin-Unet, which is a state-of-the-art implementation of the U-net architecture adding a transformer-based U-shaped encoder-decoder architecture with skip-connections for local-global semantic feature learning.
Since FA is the gold standard for the diagnosis of DR and MAs, it seems self-evident to use FA images for the evaluation of any automated detection algorithm.The challenge to this approach lies in the dynamic nature of MAs themselves.The number of MAs can vary from visit to visit 3 .Both OCTA scans and FA images would need to be be acquired during the same visit.Due to the difficulty of of obtaining OCTA scans and FA images from the same visit, we rely on a comparison to state-of-the-art networks instead.We list precision/recall and associated metrics (number of true positives, false negatives, false positives, F1-score) for each configuration.For per-pixel results we provide area-under-curve (AUC), and precision/ recall metrics.
Figure 5 shows precision/recall curves using fivefold cross-validation on the training data over the decision thresholds.Table 2 shows results for the same data at different decision thresholds.Figure 6 and Table 3 show results on the test data using an ensemble of the five U-nets, five TransUNets, and five Swin-Unets trained on each of the fivefolds of the training data.
First, we consider the fivefold cross-validation results on the training data.For each nnU-Net configuration, including the default nnU-Net and our adaptations with dice loss and focal loss, a single network was trained on each fold.TransUNet and Swin-Unet were also trained once for every one of the fivefolds.Figure 5 and Table 2 show these results.
First of all, it is apparent that the curves in Fig. 5 for the default nnU-Net and the dice loss version behave similarly due to nnU-Net's loss being a combination of dice loss and cross-entropy loss.The precision is slightly lower for dice loss, but the recall is better for dice loss when compared to nnU-Net.This does not come at a surprise considering the class imbalance in the data set and the fact that cross-entropy loss does not perform well on imbalanced data sets without compensating features such as sample weights.The precision of TransUNet is also higher when compared to the dice loss configuration, but the recall is worse for lower thresholds and slightly better for higher thresholds.This extends to the F1-scores.Focal loss, on the other hand, achieves the highest precision.It also displays the the highest recall at low thresholds but this coincides with very low precision.Precision across all networks is noticeably better in the DCP, when compared to the SCP.The opposite applies to the recall across all networks.It is generally higher in the SCP, when compared to the DCP.Swin-Unet however, consistently shows worse precision and recall when compared to the other networks.
Next, we evaluate the results of the ensembled networks on the test data in Fig. 6 and Table 3.Several of the previous observations from the results on the fivefold cross-validation data still hold true.Precision for the dice loss configuration is slightly worse than for the nnU-Net configuration.The precision of TransUNet is higher than both dice loss and default nnU-Net configurations.Again, the focal loss configuration performs best in the lower decision threshold ranges, but the F1-score is in the same range as the other configurations.The precision across all networks is slightly better in the SCP when compared to the DCP.Recall decreases in the DCP when compared to the SCP, except for the dice loss configuration.Interestingly, it appears that the dice loss network benefits from the ensembling of networks, which is a notable exception to the other networks.When considering the F1-scores on the SCP, dice loss and TransUNet show very similar performances overall, with the dice loss performing slightly better.This changes in the case of the DCP however, with the dice loss' improved recall also improving its F1-score.Swin-Unet shows improved precision when used with ensembling on the test data, its recall, however, does not notably improve.
When comparing the results for the cross-validation evaluation on the training data in Table 2 with the ensembled results on the test data in Table 3, it becomes clear that precision improves across all tested configurations for the ensembled networks.Recall however, increases for lower thresholds while it decreases for higher thresholds with the exception for dice loss in the DCP.Generally, a decrease in recall is unfortunate for use cases such as screening, where high recall (e.g., finding every possible case of the condition) is preferred over precision, to ensure as few cases as possible are missed.Note that in screening scenarios, it's often more important to identify all possible cases (high recall) rather than being overly concerned about false positives (high precision).This is because missing a true case (a false negative) can have more severe consequences than incorrectly identifying a case that isn't there (a false positive), which can usually be ruled out with further testing.
For both sets of results, the fivefold cross-validation on the training data and ensembling on the test data, it is apparent that precision is higher in the DCP.We mainly attribute this to the difference in vascular morphology between the two layers.OCTA scans of the SCP show clear and continuous vessel shapes against a black background, while the DCP shows a greater similarity to a regular distribution and small complex interconnections 48 .This can be observed in Figs.7 and 8. Also, for both sets of results, recall for the DCP decreases when compared to the SCP.Even though fewer FPs in the DCP benefit precision, we theorize that the larger number of annotated MAs in the DCP lead to slightly fewer of them being found and thus inhibiting recall.On the training data set, 1094 MAs were annotated in the DCP, while 2028 MAs were annotated on the DCP, almost twice as many.On the test data set, 313 MAs were annotated in the SCP, while 534 MAs were annotated in the DCP.This is congruent with the clinical observations in DR, where the majority of MAs tend to occur in the DCP, not the SCP 4,43 .A somewhat reduced recall in the DCP can be compensated for by the larger number of MAs in that layer, as long as the recall does not sink too closely towards 0 (see Tables 2 and 3.For instance, in case of the dice loss on  The ensembling step works by running the prediction of the five networks, each trained on a different fold of the training data, on the test data.The five predictions for each eye are then averaged.Dice loss in the DCP benefits from this step disproportionately when compared to the other networks and their losses and compared to the SCP.In the case of TransUNet for instance, it is possible that each of the five instances find different subsets of MAs in the DCP, but those fall below the size and decision threshold when ensembled.Dice loss on the DCP by comparison, performs better in this instance due to a combination of the tendency to favor contained areas with clearly delineated outlines and its resilience toward class imbalance.This is illustrated in supplementary Fig. F1, which shows the network output for patients 1 and 2 shown in Figs.7 and 8 respectively.
Overall, nnU-Net's default configuration, the dice loss configuration, and TransUNet behave very similarly due to nnU-Net's and TransUNet's loss being a combination of dice and cross-entropy loss.This can be seen in Figs. 5 and 6.The fact that the dice loss configuration achieves a better recall than nnU-Net does not come at a surprise considering the class imbalance in the data set and the fact that cross-entropy loss does not perform Figure 7 shows en face projections of the SCP and DCP from a patient's eye with PDR and macular edema.A true positive from the SCP using dice loss is enlarged in the lower left.This is a large MA that was found by all five neural networks.A false negative MA from the DCP is shown in the bottom center left.Even though this is an annotated MA, it has only been found by U-net ensemble using the dice loss configuration.The lower center right shows a false positive that has been found by the default nnU-Net configuration in the SCP.The enlarged area shows a potential vascular anomaly that could be an MA, that was not labeled.The lower right shows an MA from the SCP that was only found by the TransUNet ensemble.Figure 8 shows en face projections of the SCP and DCP from another patient's eye with PDR and macular edema.A false positive from the SCP using dice loss is enlarged in the lower left.A true positive MA from the DCP is shown in the bottom center left.This MA has been found by the dice and focal loss nnU-Net configurations but not by the default nnU-net.The lower center right shows a false negative that has not been found by the default nnU-Net configuration in the DCP.This MA could be found using the dice loss configuration.The lower right shows another MA that was only found by the TransUNet ensemble.

Conclusion and outlook
In this paper we present two things.First, we created a data set of MAs on the SCP and DCP OCTA projections from patients with DR for the training and evaluation of U-nets by two expert graders in two rounds of labeling.Secondly, we present different U-net configurations designed to detect MAs in en face projections of the SCP and DCP from OCTA scans of patients with DR and compare them with TransUNet and Swin-Unet.Our results demonstrate that it is possible to detect MAs with high accuracy/specificity albeit at the cost of recall/sensitivity.Even though higher recall is preferable in a clinical screening scenario, it never reaches zero in case of the presented dice loss configuration.The performance of the networks is generally comparable for application on the SCP and DCP, with the former benefiting from higher recall and the latter from slightly higher precision.The dice loss configuration is also the only network that benefited from ensembling in the DCP due to its resilience toward class imbalance and its ability to highlight clearly delineated areas.Overall, we demonstrate the viability of the U-net architecture for the segmentation of MAs in both the SCP and DCP in patients with DR.Using markers that are recognizable avoids the "black box" problem commonly associated with deep learning and allows clinicians to evaluate and trace the diagnosis made by the system.

Fig. 1 .
Fig. 1.Labeling workflow of the two expert graders during the first stage based on Bertram et al. 41 .Each expert reviews the available en face projections (superficial and deep capillary plexuses) of every eye twice independently of each other.Afterwards they review the data together and need to come to an agreement on each MA.The resulting expert labeled database is used for initial training.The resulting false positives and false negatives were used by the graders during the second stage of the labeled database creation.

Fig. 2 .
Fig. 2. Example of how MA areas are converted to binary labels.The top row shows three different examples of MAs within bounding boxes.The values range from 0 (black) to 255 (white).The bottom row shows the corresponding binary masks for MAs after applying a binary threshold of 150.The bright areas indicate part of a MA, the dark areas indicate background.

Fig. 3 .
Fig. 3. Ground truth target generation from the expert labeled database.The database contains the fundus projections and MA bounding boxes crated by the expert labelers (red rectangles).SCP and DCP are used as two-channel network input.The areas within the bounding boxes have a threshold applied to them and serve as binary targets for the network with one channel representing the SCP and other the DCP.

Fig. 4 .
Fig. 4. Number of labeled MAs per eye categorized by mild, moderate, and severe NPDR and PDR.The blue markers indicate the number of MAs on an eye of the given DR severity.The red markers indicate the mean number of MAs each.

Fig. 5 .
Fig. 5. From left to right: precision, recall, and F1-score curves over five training folds on the training data.The top row shows the results across both layers, with results for superficial and deep capillary plexuses shown separated below.Each neural network's results are shown in a different color.

Fig. 6 .
Fig. 6.From left to right: precision, recall, and F1-score curves using ensembling on the test data.The top row shows the results across both layers, with results for superficial and deep capillary plexuses shown separated below.Each neural network's results are shown in a different color.

Table 2 .
Results for fivefold cross-validation on the training data.Results for the three tested configurations (nnU-Net, dice, and focal loss) and TransUNet and Swin-Unet are shown both per MA and per pixel.Numbers shown are the combined results across all fivefolds.The best result in each column is marked in bold.

Fig. 7 .
Fig. 7. Patient 1: MA segmentation results on superficial (top row) and deep capillary plexuses (middle row) an eye with PDR using dice loss, focal loss, default nnU-Net, TransUNet, and Swin-Unet ensembles.Green boxes indicate MAs that were correctly identified by the U-net ensemble.Red boxes indicate false negatives and orange boxes indicated false positives.The areas with dashed outlines are shown enlarged in the bottom row.

Fig. 8 .
Fig. 8. Patient 2: MA segmentation results on superficial (top row) and deep capillary plexuses (middle row) of an eye with PDR using dice loss, focal loss, default nnU-Net, TransUNet, and Swin-Unet ensembles.Green boxes indicate MAs that were correctly identified by the U-net ensemble.Red boxes indicate false negatives and orange boxes indicated false positives.The areas with dashed outlines are shown enlarged in the bottom row.

Table 1 .
Number of eyes by disease stage in the test and training data sets.The separation into test and training data sets was randomized.

Table 3 .
Results for ensembled classification on the test data.Results for the three tested configurations (nnU-Net, dice, and focal loss) and TransUNet and Swin-Unet are shown both per MA and per pixel.The best result in each column is marked in bold.